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Q,' Abstract A new decomposition optimization algorithm, called path-following gradient- 

QJ | based decomposition, is proposed to solve separable convex optimization problems. Unlike 

path-following Newton methods considered in the literature, this algorithm does not requires 

f- ^ ' any smoothness assumption on the objective function. This allows us to handle more gen- 

rvq , eral classes of problems arising in many real applications than in the path-following Newton 

methods. The new algorithm is a combination of three techniques, namely smoothing, La- 

grangian decomposition and path-following gradient framework. The algorithm decomposes 

the original problem into smaller subproblems by using dual decomposition and smooth- 



u 

ing via self-concordant barriers, updates the dual variables using a path-following gradient 



method and allows one to solve the subproblem in parallel. Moreover, the algorithmic pa- 
rameters are updated automatically without any tuning strategy as in augmented Lagrangian 
approaches. We prove the global convergence of the new algorithm and analyze its local 
convergence rate. Then, we modify the proposed algorithm by applying Nesterov's accel- 
erating scheme to get a new variant which has a better local convergence rate. Finally, we 
fv; present preliminary numerical tests that confirm the theory development. 

Keywords Path-following gradient method • dual fast gradient algorithm • separable convex 
optimization ■ smoothing technique ■ self-concordant barrier ■ parallel implementation. 



1 Introduction 



o 

Many optimization problems arising in engineering and economics can conveniently be for- 
mulated as Separable Convex Programming Problems (SepCPs). Particularly, optimization 
problems related to a network jY(V,E) of M agents, where V denotes the set of nodes and 
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E denotes the set of edges in the network, can be cast into separable convex optimization 
problems. Several applications can be found in the literature such as distributed control, 
network utility maximization, resource allocation, machine learning and multistage stochas- 
tic convex programming [1.2 17 21 22 1. Problems of moderate size or possessing a sparse 
structure can be solved by standard optimization methods in a centralized setup. However, 
in many real applications we meet problems, which may not be suitable to solve by standard 
optimization approaches or exploiting problem structures, e.g. nonsmooth separate objective 
functions, dynamic structure or distributed information. In those situations, decomposition 
methods can be considered as an appropriate framework to tackle those problems. Particu- 
larly, Lagrangian dual decomposition techniques are widely used to decompose a large-scale 
separable convex optimization problem into smaller subproblem components, which can si- 
multaneously be solved in a parallel manner or in a closed form. 

Various approaches have been proposed to solve (SepCP) in decomposition framework. 
One class of algorithms is based on Lagrangian relaxation and subgradient-type methods of 
multipliers fTT|5| [T2l . It has been observed that subgradient methods are usually slow and nu- 
merically sensitive to the choice of step sizes in practice 1131 . The second approach relies on 
augmented Lagrangian functions, see e.g. 161 1181 . Many variants were proposed to process 
the inseparability of the crossproduct terms in the augmented Lagrangian function in differ- 
ent ways. Another research direction is based on alternating direction methods which were 
studied, for example, in |2]|7). Alternatively, proximal point-type methods were extended to 
the decomposition framework, see, e.g. [3 10). Other researchers employed interior point 
methods in the framework of decomposition such as ll8l llll[T9ll22l . 

In this paper, we follow the same line of the dual decomposition framework but in 
a different way. First, we smooth the dual function by using self-concordant barriers. By 
an appropriate choice of the smoothness parameter, we show that the dual function of the 
smoothed problem is an approximation of the original dual function. Then, we develop a 
new path-following gradient method for solving the smoothed dual problem. By strong du- 
ality, we can also recover an approximate solution for the original problem. Compared to the 
previous related methods mentioned above, the new approach has the following advantages. 
First, since a self-concordant barrier function only depends on its barrier parameter, this al- 
lows us to avoid a dependency on the diameter of the feasible set as in prox-function smooth- 
ing techniques I10II19I . Second, the proposed method is a gradient-type scheme which al- 
lows to handle more general classes of problems than in path- following Newton methods (9] 
[T91I221 . in particular the nonsmoothness of the objective function. Third, by smoothing via 
self-concordant barrier functions, if the objective function is smooth then instead of solving 
the primal subproblems as general convex programs we can treat them by using optimality 
conditions which are equivalent to solving nonlinear systems. Finally, by convergence anal- 
ysis, we provide an adaptive update for all the algorithmic parameters which still ensure the 
convergence of the new methods. 

Contribution. The contribution of the paper can be summarized as follows: 

(a) We propose to use a smoothing technique via barrier function to smooth the dual func- 
tion of (SepCP) as in (U|9f22]. However, we provide a new estimate for the dual func- 
tion, see Lemma [T] 

(b) We propose a new path-following gradient-based decomposition algorithm, Algorithm 
[TJ to solve (SepCP). This algorithm allows one to solve the subproblem of each com- 
ponent in parallel. Moreover, all the algorithmic parameters are updated automatically 
without using any tuning strategy. 

(c) We prove the convergence of the algorithm and estimate its local convergence rate. 
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(d) We modify the algorithm by applying Nesterov's accelerating scheme to obtain a new 
variant, Algorithmic] which possesses a convergence rate, i.e. 0(l/e), where e is a given 
accuracy. 

Let us emphasize the following points. The new estimate of the dual function considered 
in this paper is different from the one in 11191 which does not depend on the diameter of 
the feasible set of the dual problem. The worst case complexity of the second algorithm is 
0(l/e) which is much higher than in subgradient-type methods of multipliers IT1|5| I121 . We 
notice that this convergence rate is optimal in the sense of Nesterov's optimal schemes 1131 
[l4l for this class of algorithms. Moreover, we can choose smoothness parameter to adjust 
this convergence rate. All the algorithms can be implemented in a parallel manner. 

Outline. The rest of this paper is organized as follows. In the next section, we state the 
problem formulation and review the Lagrangian dual decomposition framework. Section 
[3] considers a smoothing technique via self-concordant barriers and provides an estimate 
for the dual function. The new algorithms and their convergence analysis are presented in 
Sections [4] and [5] Preliminarily numerical results are shown in the last section to verify our 
theoretical results. 

Notation and Terminology. Throughout the paper, we work on the Euclidean space W 
endowed with an inner product x T y for x,y 6 M." and the norm ||a||2 := Vx T x. For a proper, 
lower semi-continuous convex function /, df(x) denotes the subdifferential of / at x. If 
/ is concave then we also use df(x) for its super-differential at x. For any x 6 dom(/) 
such that V 2 f(x) is positive definite, the local norm of a vector u with respect to / at x is 
defined as ||«|| A := [u T V 2 f(x)u] and its dual norm is ||m||* := max{w r v | ||v|| v < 1} = 

[u T V 2 f(x)~ l u\ . It is obvious that u T v < ||m||a|M|.*- The notation E + and R ++ define the 
sets of nonnegative and positive numbers, respectively. The function CO : R + — > R. is defined 
by co(t) :=?-ln(l +t) and its dual function ffl* : [0, 1) -> R is 0)*(f) := -f-ln(l -t). 



2 Separable Convex Programming Problems and Lagrangian Dual Decomposition 

A Separable Convex Programming problem (SepCP) is typically written as follows: 



max (j)(x) := Y<t>i(xj), 



* (SepCP) 

s.t. £(A,*,-Z>,)=0, 
;=i 
XieXi, /=!,■■■ ,N, 



where the decision variable x := {x\,... ,Xjv) with xj 6 K" 1 , the function 0,- : R"< — > M. is 
concave and the feasible set is described by the set X := X\ x ■ ■ • x X/y, with Xj 6 W' 
nonempty, closed, convex sets for all i= l,--- ,N. Matrix A := [Ai,...,Ajv], with A,- e R mxn ' 

for i = 1, . . . ,N, b := Y^=\ h £R" and nH \-n N = n. The constraint Ax - b = in 

([SepCPl is called coupling linear constraint, while x, 6 Xj is referred to as local constraints 
of the i'-th component (agent). 

Let Jf(x,y) := (j)(x) +y T (Ax — b) be the partial Lagrangian function associated with the 
coupling constraint Ax — b = of JSepCPl. The dual problem of dSepCPl is written as: 

g*:=ming(y), (1) 
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where g is the dual function defined by: 

g(y) •.= max3f(x,y)=max.{Q(x)+f(Ax-b)}. (2) 

Due to the separability of (j), the dual function g can be computed in parallel as: 

N 

gM = I>M, H(y) ■-max{<l) i (x i )+y T (A i x i -b i )},i=l,...,N. (3) 

Throughout this paper, we make the following assumptions: 

Assumption A.1 The following assumptions hold, see M8V : 

(a) The solution set X* of dSepCPl is nonempty. 

(b) Either X is polyhedral or the following Slater qualification condition holds: 

ri{X)n{x\Ax-b = 0}^®, (4) 

where h(X) is the relative interior ofX. 

(c) The functions 0,-, /' = l,...,N, are proper, upper semicontinuous and concave and A is 
full-row rank. 

Assumption Ajl]is standard in convex optimization. Under this assumption, strong duality 
holds, i.e. the dual problem (0 is also solvable and g* = (j>*. Moreover, the set of Lagrange 
multipliers, Y* , is bounded. However, under Assumption AjTJ the dual function g may not 
be differentiable. Numerical methods such as subgradient-type and bundle methods can be 
used to solve (Q. Nevertheless, these methods are in general numerically intractable and 
slow. 



3 Smoothing via self-concordant barrier functions 

In many practical problems, the feasible sets X,, i = 1, . . .,N are usually simple, e.g. box, 
polyhedra and ball. Hence, Xj can be endowed with a self-concordant barrier (see, e.g. 1161 
[T31 ) as in the following assumption. 

Assumption \2 Each feasible set Xj, i = 1 , . . . , N, is bounded and endowed with a self- 
concordant barrier function F; with the parameter V; > 0. 

Note that the assumption on the boundedness of Xj can be removed by assuming that the set 
of sample points generated by the new algorithm described below is bounded. 

Remark 1 The theory developed in this paper can be easily extended to the case X, given as 
follows (see 11511 ) for some i 6 {1, • ■ ■ ,N}: 

X t := Xf C\Xf, X? := {x t e R"' : D iXi = d t } , (5) 

by applying the standard linear algebra routines, where the set X? has nonempty interior and 
associated with a v,-self-concordant barrier F,\ 
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Let us denote by x\ the analytic center of X,-, i.e.: 

x\ := arg min Ffa) Vi = 1, . .. ,N, (6) 

j:,Gint(X,) 

where int(X,) is the interior of X,. Since X, is bounded, x? is well-defined 1131 . Moreover, 
the following estimates hold: 

F i {xi)-F i {4)>&{\\xi-4\\ x c)m6.\\x i -4\\ y ! i <V i + 2^V u \/ Xi eX h i=l,...,N. (7) 

Without loss of generality, we can assume that Fi(xf) = 0. Otherwise, we can replace F, by 
//(•) := Fj(-) —Fi(xj) for i = l,...,N. Since X is separable, F := YJiLi Pi ls a self-concordant 
barrier of X with the parameter v := J^lj V,. 
Let us define the following function:: 

JV 

g(y;t):=£ 8i (y;t), (8) 



where 



&(y,t):= max {^(^) +y T (A iXi -bi) -tFifa)} , i= 1,...,N, (9) 

. t ,6int(X,) 



with t > being referred to as a smoothness parameter. Note that the maximum problem in 
^ has a unique optimal solution, which is denoted by x*(y\i), due to the strict concavity 
of the objective function. We call this problem the primal subproblem. Consequently, the 
functions gi(-,t) and g(-,t) are well-defined and smooth on R™ for any t > 0. We call gi(-',t) 
and g(-;t) the smoothed dual function of g,- and g, respectively. 
The optimality condition for (|9]l is: 

OedUxtiy-J^+Afy-tVF^iy-j)), i=\,---,N. (10) 

Let us define the full optimal solution x* (y;t) := (x\(y;t), ■ ■ ■ ,jc^,(y;f)). Since problem (|9]l is 
convex, this condition d 10b is necessary and sufficient for optimality. Moreover, the gradients 
ofg,(-;f) and g(-;t) are given by: 

Vg i (y,t)=A iX *(y;t)-b i , V 'g(y;t) = Ax* (y;t) - b. (11) 

If ft is differentiable for some i G { 1 , • • • ,N} then the condition d 10b collapses to V0,- (x* (y; t )) + 
Ajy — tVFj(x*(y;t)) = 0, which is indeed a system of nonlinear equations. First, we prove 
that g(-;t) is an approximation of the dual function g(-) for sufficiently small t > 0. 

Lemma 1 Suppose that Assumptions A^and Af2\are satisfied. Let x be a strictly feasible 
point to ) |SepCP[ l, i.e. x 6 int(X) PI {x \ Ax = b}. Then, for any t > 0, we have: 

g(y)-(j)(x)>0 and g(y;t) -(j)(x) +tF(x) > 0. (12) 

Moreover, it holds that: 

g(y,t) < g(y) < g(y;t) +t(v +F(x)) + 2V^[g(y,t) +tF(.x) - <p(x)f 2 . (13) 
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Proof The first two inequalities in (1 12b are trivial due to the definitions of g(-), g{-;t) and 
the feasibility of x. We only prove dl3b . Indeed, since x 6 int(X) and x* (y) € X, if we define 
jt*(y) :=x+ z{x*(y) —x), then x T (y) € intX if t € [0, 1). By applying the inequality [16] 
2.3.3] we have: 

F(* T (y))<F(x)-vln(l-T). 

Using this inequality together with the definition of g(-',t), the concavity of (j) and Ax = b, 
we deduce: 

g(y\t)= max {<j>(x)+y T (Ax-b)-tF(x}} 

x&at(X) 

> max { <t> {x t (y) ) + y T {Ax x {y)-b) -tF(x x (y))} (14) 

T€[0,1) 

> max {(l-T)^(x) + Tg(y)+?vln(l-T)}-fF(jf). 

T6[0,l) 

By solving the maximization problem on the right hand side of d 14b and then rearranging 
the results, we obtain: 

g (y)<g(yU)+t[v + F(x)\+tv[ln( K 8{y) ~J ){ -" ) )] + , (15) 

where [-] + := max {■,()}. Moreover, it follows from (1141 that: 

g(y) -<j)(x) < I [g(y;t) - <j>(x) +tF(x) +fvln(l + ^)] 



If we minimize the right hand side of this inequality in [0, 1), then we get g(y) — (j)(x) < 
[(g(y;t) — <l>{x) +tF(x)) ' + \/tv] 2 . Finally, we plug this inequality into fl!5b to obtain: 



<g{y;t)+tv+tF(x)+2Vtv[g(y;t) -(j)(x) +tF{x)] l/2 , 

which is indeed dl 3b . □ 

Remark 2 (Approximation of g(y)) It follows from d 1 3b that g(y) < (1 + 2\/tv)g(y;t) + 
t(v + F(x)) + 2^tv(tF(x) — <p(x)). Hence, g(y,t) —> g{y) as t — > + . Moreover, this estimate 
is different from the one in 1191 . since we do not assume that Y is bounded. 

Next, we consider the following minimization problem, called smoothed dual problem: 

g*(t) :=g(y*(t);t) = mm g(y;t). (16) 

We denote by y*(t) the solution of ( 116b . The following lemma shows the main properties of 
the functions g(y; •) and g* (•). 

Lemma 2 Suppose that Assumptions A^and A^\are satisfied. Then: 

(a) The function g{y',-) is convex and nonincreasing in M++ for a given y 6 K m . Moreover, 
we have: 

g(y;i)>g(y,t)-(t-t)F(x*(y;t)). (17) 
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(b) The function g*(-) defined by J 16b is differentiable and nonincreasing in M++. Moreover, 
g*(t) < <?*. li m flo+£*W = S* = 0* flwrfx*(y*(f);f) is feasible to the original problem 
( |SepCPp . 

Proof We only prove d!7t . the proof of the remainders can be found in 191 11911 . Indeed, 
since g(y;-) is convex and differentiable and g j J ' = — F(x*(y;t)) < 0, we have g(y;t) > 

g (y;t) + {i-t) d -^l=g{y;t)-{i-t)F(x*(y;t)). D 

The statement (b) of Lemma[2]shows that if we find an approximate solution y k for d 1 6b 
for sufficiently small tjc, then g*(tk) approximates g* (recall that g* = <j)*) and x*(y k ;tk) is 
approximately feasible to JSepCPl. 



4 Path-following gradient method 

In this section we design a path-following gradient algorithm to solve the dual problem |Q}, 
analyze the convergence of the algorithm and estimate the local convergence rate. 



4. 1 The path-following gradient scheme 

Since g(-;t) is strictly convex and smooth, we can write the optimality condition of J 16b as: 

Vg{y;t) = 0. (18) 

This equation has a unique solution y* (t). 

Now, for any given x 6 int(X), VF(x) is positive definite. We introduce a local matrix 
norm: 

:= ||AV 2 FM-'A r || 2 , (19) 



The following lemma shows a main property of the function g(-;t). 

Lemma 3 Suppose that Assumptions AJ7] and A$2\ are satisfied. Then, for all t > and 
y,yE R m , one has: 



[v 8 (r,t)-v 8m n y -y)> '^-wf * 1 

CA[CA + \\vg(y,t)-Vg(y,t)\\2\ 



(20) 



where ca '■— \\\A\\\**( . f \. Consequently, it holds that: 

g(y;t)<g(y;t)+Vg(y;t) T (y-y)+tCO*(c A t- l \\y-y\\ 2 ), (21) 

provided that C& \\y — y||2 < t. 

Proof For notational simplicity, we denote by x* := x*(y;t) and x* := x*{y;t). From the 
definition ill It of Vg(-;r) and the Cauchy-Schwarz inequality we have: 

\Vg{y\t)-Vg(y;t)] T (y-y) = {y-y) T A(x* -f). (22) 

\\Vg(yU)-Vg(y;t)\\ 2 <\\\A\\\: t \\x*-x*\\ x ,. (23) 
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It follows from $M that A T (y-y) = t[VF(x*) - VF(x*} - [§(**) -|(Jf )], where £,(■) e 
d<f>(-).By multiplying this relation with x* — x* and then using 1131 Theorem 4. 1 .7] and the 
concavity of (j) we obtain: 

( y -yfA(x*-x*) = t{VF(x*) - VF(x*)] T (x* -x*) - [§(*•) - £(£*)] V -x*) 

concavity of 

> r[VF(x*)-VF(i*)] 7 '(x*-i*) 

> t\\x*-x*\\l 



l + ||x*-x*||. v . 
E) r[||Vg(y;0-V g (y;Q|[ 2 ] 2 

" \\\A\\\*,[\\\A\\\*, + \\S7 g (y,t)-Vg(y;t)\\ 2 Y 

Substituting this inequality into l !22b we obtain d20t . 

By the Cauchy-Schwarz inequality, it follows from ( 120b that ||Vg(y;j) — Vg(y;r)|| < 

f-^c I I'- I I ' P rov ided that c^Hf — y|j < t. Finally, by using the mean-value theorem, we have: 

g(y;t) = g{y,t) + Vg{y,t) T (y-y)+ [ {Vg{y + s{y-y);t)-Vg(y,t)) T (y-y)ds 

Jo 

<g(yu) + Vg(y;t) T (y-y) + c A \\ y -y\\ 2 f - CA4 \~ y \ ds 

Jo t - c A s\\y - y\\ 2 
= g{y,t) +Vg{y;t) T (y-y) +ta*{c A r l \\y -y\\ 2 ), 

which is indeed d21b provided that ca \\$— y\\i < t. □ 

Now, we describe one step of the path-following gradient method for solving d 16b . Let 
us assume that y 6 W and t > are the values at the current iteration, the values y + and t + 
at the next iteration are computed as: 

t+ :—t-At, 

(24) 

y+ := y -aVg(y,t + ), 

where a := Oc(y;t) > is the current step size and At is the decrement of the parameter t. In 
order to analyze the convergence of the scheme (124) . we introduce the following notation: 

x\ :=x*(y;t+), c Ai = \\\A\\\;. t [y . l+) and Aj := \\Vg(y;t+)\\ 2 . (25) 

First, we prove an important property of the path-following gradient scheme ( 124b . 
Lemma 4 Under Assumptions A^and A^\ the following inequality holds: 

g{,y+\t+)<g{y,t)-[aXl-t + a*{c M tl l aXy)-AtF{x\)}, (26) 

where c A \ and X\ are defined by ( 125b . 
Proof Since t + = t — At, by using 017b with t and t + , we have: 

g(r,t+) < g(y,t)+AtF(x*{y;t+)). (27) 

Next, by d21b we have y+—y = —CcVg(y,t + ) and Ai := || Vg(y;t+)\\2- Hence, we can derive: 

g{y+U+) < g{y,t+)- atf + t + w*(c A1 a?i 1 t- 1 ). (28) 

By plugging d27b into d28b , we obtain d26b . □ 



Path-Following Gradient-Based Decomposition Algorithms For Separable Convex Optimization 



Lemma 5 For any y 6 W" and t > 0, the constant c A '■= |||A|||*»/ ,.. ■. is bounded. More 
precisely, c A < c A := k\\\A\\\* l - < +°°. Furthermore, A := ||Vg(y;f)||2 is also bounded, i.e.: 

A < A := k\\\A\\\*, + \\Ax c -b\\ 2 , where K := l£ 1 [v,- + 2 v %]. ' 

Proof For any x G int(X), from the definition of ||| • |||*, we have: 

|||A|||* = sup{[v r AV 2 J F(*)- 1 A r v] 1 / 2 : ||v|| 2 = l} 
= sup{H* : u=A T v, ||v|| 2 = l} 
<sup{(v + 2 A /v)||«|& : u=A T v, ||v|| 2 = 1} 

= (v + 2 V / V)sup{[v r AV 2 F(^r 1 A r v] 1/2 , ||v|| 2 = l} 

= (v + 2^)|||A|||^. 

Here, the inequality in this implication follows from |[T3l Corollary 4.2.1]. By substituting 
x = x* (y; t) into the above inequality, we obtain the first conclusion. In order to prove the 
second bound, we note that Vg(v; t ) = Ax* (y,t) — b. Therefore, by using ((7}, we can estimate: 

||V*Cy;0ll2 = \\Ax*(y;t)-b\\ 2 < \\A(x*(y;t) -x c )\\ 2 + \\Ax c - b\\ 2 
<\\\A\\\:4x*(y;t)-x%, + \\Ax c -b\\ 2 

<K\\\A\\\:, + \\Ax c -b\\ 2 , 

which is the second conclusion. □ 

Next, we show how to choose the step size a and the decrement At such that g(y+ ;t+) < 
g(y\t) inLemma|4] We note thatx*(y;t+) is obtained by solving the subproblem l(9} and the 
quantity cy := F(x*(y;t+)) is nonnegative and computable. By Lemma[5] we see that: 

a W := i \i \ * ffl>(0 : = - 7^-tt . (29) 

cai(cai + M) C A (c a + A) 

which shows that a(t) is bounded away from zero. We have the following estimate. 
Lemma 6 The step size cc(t) defined by d29l > satisfies: 

g(y+;t + )<g(y;t)-t + w(^\+AtF(xl). (30) 

Proof Let <p{a) := aAj 2 — t+(D*(cA\t+ ttk\) — f+0)(Aic7f). We can simplify this function 
as (p (a) = t + [u + \n(l — «)], where u :=£7_ A^0£+f7_ ca\A.\<x — c7f Ai. The function <p(a) < 
for all u and (p(a) = at u = which leads to a := — , - , , , . D 

Since t + = t -At, if we choose A? := ^gM_ then: 

§(}' + ;f + )<g(y;0-^a)(c A1 1 A 1 ). (31) 

Therefore, the update rule for f can be written as: 

r+:=(l-CT)f, where a := , M 6(0,1). (32) 

2[< B (c7 1 1 A 1 )+F(x*)] 
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4.2 The algorithm 

Combing the above analysis, we can describe the path-following gradient decomposition 

method is follows: 

Algorithm 1. (Path-following gradient decomposition). 

Initialization: 

1. Choose an initial value t o > and tolerances e, > and e g > 0. 

2. Take an initial pointy € M!" and solve (3]l in parallel to obtain jcq :—x*(y ,to). 

3. Compute c° := |||A|||*», Ao := ||Vg(y ;fo)|| 2 , Wo ■= »(Vc°) and c° F := >(**). 



Iteration: For fc = 0, 1, • - - , perform the following steps: 

Step 1: Update the barrier parameter: £# + i :=ffc(l — Ojfc), where Cfy 



(0,, 



2(0) t +4)' 

5Ye/? 2: Solve l[3) ;'« parallel to obtain .*£ := ;c* (y* , fy + 1 ) . Then, form the gradient vector 
Vg(y k ;t k+l ):=Axl-b. 
Step 3: Compute X k+X := ||Vg(/;^ +1 )|| 2 , c* +1 := |||A|||*», a k+l := a{X k+l /c k + l ) and 

4 +1 :=F(4). 

Step 4: If t < Et and X k < e, then terminate. 

5fep 5: Compute the step size a k+ \ := t+i, / + V • 

C A i c A +4+1 ) 

5fep 6: Update y k+l as: 

/ +1 :=/-^+iVg(/,f, +1 ). 

End. 

The main step of Algorithm [T] is Step 2, where we need to solve in parallel the primal 
subproblems. To form the gradient vector Vg(-,t k+i ), one can compute in parallel by mul- 
tiplying column-blocks A,- of A by the solution x*(y k ,t k+ \). This task only requires local 
information to be exchanged between the current node and its neighbors. 

From the update rale (I32t of t we can see that a k — > + as F(x* k ) —> °°. This happens 
when the barrier function is approaching the boundary of the feasible set X. Hence, the 
parameter t is not decreased. Let Cf be a sufficiently large positive constant. We can modify 
the update rule of t as: 

\t k (l- -, mk A -J ifci<c F , 
t k+l :={ U ^W)> F - (33) 

yt k otherwise, 

In this case, the sequence {t k } generated by Algorithm [T]might not converge to zero. More- 
over, the step size a k computed at Step 5 depends on the parameter t k . If t k is small then 
Algorithm[T]makes short steps toward a solution of ([!}. 



4.3 Convergence analysis 

Let us assume that t_ = mf k> ot k > 0. Then, the following theorem shows the convergence of 
Algorithm Q] 

Theorem 1 Suppose that Assumptions A\l\and A$2\are satisfied. Suppose further that the 
sequence {(y ,t k ,h k )} k> Q generated by Algorithm\l\satisfies t_ := infj.>o{?t} > 0. Then: 

lim||Vg(/,f /t+1 )||2=0. (34) 

Consequently, there exists a limit point y* of {y } such that y* is a solution of d 1 6b at t = t_. 
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Proof It is sufficient to prove d34t . Indeed, from ( Bit we have: 

£!©(W<i +1 ) <g(y°;to)-g(y k+1 ;t k+ i) <g(y°;t )-g*. 

(=0 L 
Since t k > t_ > and c A +l < ca due to Lemma|5] the above inequality leads to: 

f E ®(^+i/^) < g(y°;t ) -g* < +°°. 

This inequality implies lim^^oc a>(A^ + i /ca) = 0, which leads to lim^^co/i/t+i = 0. By defi- 
nition of Xk we have lim^oo 1 1 Vg (y* ; fy- + 1 ) 1 1 2 = 0. D 

Remark 3 From the proof of Theorem [T] we can fix c A = c := K"|||A|||* t in Algorithm[T] This 
value can be computed a priori. 

4.4 Local convergence rate 

Let us analyze the local convergence rate of Algorithm [TJ Let y° be an initial point of Algo- 
rithm[T]and y*(t) be the unique solution of d 1 6b . We denote by: 

r (t):=\\y -f(t)\\ 2 . (35) 

For simplicity of discussion, we assume that the smoothness parameter tk is fixed at t_ > 
sufficiently small for all k > (see Lemma[T]l. The convergence rate of Algorithm [T]in the 
case fy = t_ is stated in the following lemma. 



Lemma 7 (Local convergence rate) Suppose that the initial point y is chosen such that 

g^tJ-g^tjK-^.Then: 

8 (At)- g *(t)< J^f? (36) 

\bcAro{t_)-\-itk 

Consequently, the local convergence rate of Algorithm\l\is at least O I — A ° k ). 

Proof Let A k := g(y k ;f) — g*{t) and y* := y*(f). Then A^ > 0. First, by the convexity of 
g(-',t) we have: 

^ = g(/;£)-/(r)<||Vg(/ ! f)|| 2 ||/- z *|| 2 = A jt |b -/|| 2 <^ a). 

This inequality implies: 

A* >rfc (*)"%■ (37) 

Since fy = t_ > is fixed for all k > 0, it follows from l !26b that: 

g(y k+1 u_)<g(y k ;t)-tMh/&), 

where Xj, := ||Vg(3r;f)|| 2 and c A := |||A|||* ,. ,. By using the definition of A k , the last in- 
equality is equivalent to: 

A k+l <A k -m{hlA)- (38) 
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Next, since ©(t) > T 2 /2 - T 3 /3 > T 2 /4 for all < X < 3/4 and c A < c A due to Lemma|5] 
it follows from OUl and (|38} that: 



Ak+\ <A- (t_A{)/(4r Q (t_) z c z A ), (39) 



for all A k < 3c A r {t_)/4. 



Let J] := t_/(4r Q (t_) 2 c 2 l ). Since z\* > 0, (|39]> implies: 



1 1 1 71 1 

> = 1 ! > 1- J?. 

A k+l A k {\-r\A k ) A k (\-r\A k ) A k 



4o 



By induction, this inequality leads to -£- > 4- + T\k which is equivalent to A k < y—— 
provided that Aq < 3c A ro(t)/4. Since r\ := t/{4ro(t) 2 c A y ), this inequality is indeed d36b - The 
last conclusion follows from d36t . D 



5 Fast gradient decomposition algorithm 

Let us fix t = t > 0. The function g(-;t_ ) is convex and differentiable but its gradient is not 
Lipschitz continuous, we can not apply Nesterov's fast gradient algorithm 1131 to solve ( 116t . 
In this section, we modify Nesterov's fast gradient method in order to obtain an accelerating 
gradient method for solving (1161 . 

One step of the modified fast gradient method is described as follows. Let y and v be 
given points in e K m , we compute new points y + and v+ as follows: 

y+ :=v-aVg(v;t_), (4Q) 

where a : = tj (c A (c A + A ) ) is the step size, jSi , p\ and JS3 are three parameters which will be 
chosen appropriately. First, we prove the following estimate. 

Lemma 8 Let 6 6 (0, 1) be a given parameter and p := t /{2Qc^). We define two vectors: 

r := 0- 1 [v- (l-0))>] and r+ = r-pVg(v;t_). (41) 

Then the new point y + generated by ( 140b satisfies: 

e- 2 (*Cyv.i)-£*)+r 1 dlk+-/||l<e- 2 (i-e)U(y:i)-£*)+r 1 dlk-z*||i, (42) 

provided that ||Vg(v;f)||2 < 3c A /4, where y* :=y*(t_) and g* := g{y*;t). 
Proof Since y + — v — aVg(y;t) and a = - if +u ' ^ f°ll° ws from d21b that: 

g(y+;t) < g(v;f) -t»(||Vg(v;t)|| 2 /2A). (43) 

Now, since 0)(t) > T 2 /4 for all < T < 3/4, the inequality d43l implies: 

g(?+;t) <g(v;f)- ^-11 Vg(v;0||2, (44) 
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provided that ||Vg(v;f)||2 < 3ca/4. For any u := (1 - 8)y+ 8y* and 6 G (0, 1) we have: 

g(v;t_)<g( U ;i)+Vg(vj_) T (v-u)<(l-e)g(y,L) + eg(f;t_) 

+ Vg{v;t_) T (v-{l-e)y-ef). (45) 

By substituting l !45b and the relation v — ( 1 — 8)y = 6r into d44b we obtain: 

8(yVJj<(l-0)g(y;t_) + eg_* + dVg(vu_) T (r-f)-pT\\Vg(v;t)\\ 2 2 

e 2 c 2 A r„ .,., „ t 



(l-e)g(y; L ) + 9g* + ^\\\r-y*f 2 -\\r- -^VgivjJ-fW 2 ] 

(i-e)^(y;f)+e£*+^[ll'--/||i-||r + -/||i]. (46) 



Since 1/0- = (1 - 0)/0 2 + 1/0, by rearranging (g6j we obtain <g2). D 

Next, we consider the update rale of 8. We can see from d42b that if 0+ is updated such that 
(1 — d+)/d+ = 1/0 then g(;y+;£) < g(y;£). The above condition implies: 



e + = o.56(ve 2 +4-e). 

The following lemma provides an estimate for k > 0. 

Lemma 9 The sequence {8k} k >o generated by 8k+\ := 0.5%[(9 A 2 +4) 1 / 2 — 8k] and do = 1 
satisfies: 

—-— < 0* < -?-, Mk > 0. (47) 

2fc+ 1 - - fc + 2 

Proof We note that 0i.ii = . If we define si- :=2/8i- then the last relation implies 

which leads to — ^ < — ^- < -^-r. Hence, Sj + 1 < s^+l < ** + 2. By 



2 



s t+l /j 2 +l + l ' ' 5 *+ 2 s *+l ' 5 * +1 ' 

induction, we have so + k < Sk < so + 2k. More over, do = 1, we have so = 2. Substituting 
sq = 2 into the last inequalities and then using the relation Sk = 2/ 9k we obtain ( 147b . □ 

By Lemma[8l we have r + = r — pVg(v;f) and r + = 6+ l (v + — (1 — + )y + ). From these 
relations, we deduce that: 

v + = (l-0 + )y+ + + (r-pVg(v;f)). (48) 

Note that if we combine l !48b and ( 140b then: 

v+ = (1- d+ - a- l p8 + ) y+ - 8~ x 8+{\ - 0)y + (0 -1 +a~ l p) 6+v. 

This is the second line of l@0]l, where J3i := 1 - 0+ -pS+a" 1 , jS 2 := -(1 - 0)0+0~' and 
JS3 := (0 _1 + pa~')0 + . By combining all the above analysis, we can describe the modified 
fast gradient algorithm in detail as follows: 
Algorithm 2. (Fast gradient decomposition algorithm). 
Initialization: Perform the following steps: 

1 . Given a tolerance e > 0. Fix the parameter / at a certain value t_ > 0. 

2. Find an initial point y° £ W" such that Ao := ||Vg(y°;f)|| 2 < ^J 1 . 

3. Set0o:=landv°:=y o . 
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Iteration: For k — 0, 1, ■ • • , perform the following steps: 

Step 1: If X k < e then terminate. 

5?ep 2: Compute r k := flf 1 [v A ' - (1 - 0*)/]. 

Sfe/? J: Update y* +1 as: 

/+i := v i '-a,Vg(v A -,£) 



where at = - ,_-. * r . 

Step 4: Update 8 k+1 := ±0,[(02 + 4) 1 / 2 - %]. 
Step 5: Update 

v k+1 := (1 - 6 k+1 )y k+1 + e k+1 (r k - Pk V g (v k )), 

where ^ := 2ifv 

Step 6: Solve l|9]l in parallel to obtain x| +1 := x*{v k+l ;f). Then form a gradient vector 

Vg(v* +1 ;f) :=A*| +1 -fc and compute A*+i := || V^(v' fc+1 ; f ) || 2 - 

End. 

The core step of Algorithm [2] is Step 6, where we need to solve M primal subproblems 
in parallel. Algorithm |2]differs from Nesterov's fast gradient algorithm 1131 at Step 5, where 
Vjt_|_i not only depends on y + and v* but also on Vg(ir;t). 

The following theorem shows the convergence of Algorithm [2] 

Theorem 2 Let y° e R m be an initial point of Algorithm^ such that ||Vg(y°;f)||2 < ^p. 
Then the sequence {(y , v )}/t>o generated by Algorithm\2\satisfies: 

g{y k ;t)-g*{t_) < j0^\\y°-y*(i)\\ 2 - (49) 

Proof From d42b and the update rale of 9 k , we have: 

^W +1 ;£)-£*)+^rV +1 -3?ll2< ^ 2 (i -%)C*(y*;i)-£*)+drV-)?ll2 

<Ci(*Cy k ;0-£*)+d£ _1 ll'*-y*ll2 

By induction, we obtain from this inequality that: 

e^M^D-g*) < OoW;t)-g;)+ZAt- 1 \\r 1 -y*\\ 2 2 

<(l-eo)e o 2 (g(y ;t_)-g_*)+c 2 Al - l \\r -f\\i 

for k > 1 . Since 0o = 1 and y = v , we have r° = y° and the last inequality implies g (y k ; t_ ) — 
8* < A Q l-\C X \\y° - ylll- Since e *-i < ITT due t0 Lemma|9] we obtain @3). □ 



Let us denote by: 



-- ^ Ca 1 



#(&;*) := \f e r " I II W;*)ll2 <~t)- ( 5 °) 

It is obvious that y*(t) € .^"(cU;?). This set is a neighbourhood of the solution y*(t) of the 
problem dT6b . 

Remark 4 Let £ > be a given accuracy. If we fix the barrier parameter t_ := e then the 
worst-case complexity of Algorithm [2] in the neighbourhood Si{c A \t) is O( ~ t ^ £o ), where 
£o:=r (£). 
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Remark 5 (Switching strategy) We can combine Algorithms [T] and [2] to obtain a switching 
variant: 

- First, we apply Algorithm Q] to find a point y° e R m and t_ > such that |j Vg(f°;f)||2 < 

4 • 

- Then, we switch to use Algorithm [2] 

We can also replace the constant ca in Algorithm [2] by any upper bound ca of c^. For in- 
stance, we can choose ca '■= max {«U,4||Vg(y ,f)||2/3|. 



6 Numerical tests 

In this section, we test the switching variant of Algorithms [T] and [2] proposed in Remark [5] 
which we name by PFGDA for solving the following convex programming problem: 



min 7||x|| i+/(*) 

S.t. 



(51) 



where /(*):= I" =1 fi(xt), and /; : K -+ K is a convex function, A e W x ", b € W and 
/, m e R" such that / < < u. 

We note that the feasible set X := [/, u] can be decomposed into n intervals X, := [/,-, m,-] 
and each interval is endowed with a 2-self concordant barrier Fj(xi) := — ln(jc,- — /,■) — ln(M,- — 
x;) +21n((i// — /;)/2) for j = l,...,n. Moreover, if we define (j)(x) := — E" =1 [/i(x,-) + 7|jc;|] 
then is concave and separable. Problem d5U can be reformulated equivalently to ( |SepCP| . 

The smoothed dual function components gi(y;t) of d5 1 b can be written as: 

gi(y;t)= max {-f i (x i )-y\xi\ + (Ajy)x i -tF i (x i )}-b T y/n, 

lj<Xj<Uj 

for i = 1, . . . ,n. This one- variable minimization problem is nonsmooth but it can be solved 
easily. In particular, if /; is affine then this problem can be solved in a closed form. In case 
/, is smooth, we can reformulate (15 1) into a smooth convex program by adding n slack 
variables and 2w additional inequality constraints to handle the ||jc||i part. 

We have implemented PFGDA in C++ running on a 16 cores Intel ®Xeon 2.7GHz work- 
station with 12 GB of RAM. The algorithm was parallelized by using QpenMP. We termi- 
nated PFGDA if: 

optim := ||Vg(yV,)|| 2 /max{l,||Vs(y ;fo)|| 2 } < KT 3 and t k < 1(T 2 . 

We have also implemented two algorithms, namely decomposition algorithm with two pri- 
mal steps 1201 Algorithm 1] and decomposition algorithm with two dual steps in 1191 Algo- 
rithm 1] which we named 2pDecompAlg and 2dDecompAlg, respectively, for solving prob- 
lem d51b and compared them with PFGDA. We terminated 2pDecompAlg and 2dDecompAlg 
by using the same conditions as in I19U20I with the tolerances £f eas = £f un = £ bj = 10 -3 
and j m . dx = 3. We also terminated all three algorithms if the maximum number of iterations 
maxiter := 10,000 was reached. In the last case we clarify that the algorithm is failed. 

a. Basis pursuit problem. If the function f(x) = for all x then problem d51t becomes a 
bound constrained basis pursuit problem to recover the sparse coefficient vector .* of given 
signals based on a transform operator A and a vector of observations b. We assume that 
A 6 R mx ", b 6 R m and x G W, where m < n and x has k nonzero elements (k <§C n). 
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In this case, we only illustrate PFGDA by applying it to solve some small size test prob- 
lems. In order to generate a test problem, we generate a random orthogonal matrix A and a 
random vector xq which has k nonzero elements. Then we define vector b as b :=Axq. 

We test PFGDA on the four problems such that [m,n,k] are [50, 128, 14], [100,256,20], 
[200, 512, 30] and [500, 1024, 50]. The results reported by PFGDA are plotted in FigureQ] 




Fig. 1 Illustration of PFGDA via the basis pursuit problem 

As we can see from these figures that the vector of recovered coefficients x matches very 
well the vector of original coefficients xq in these four problems. Moreover, PFGDA requires 
376, 334, 297 and 332 iterations, respectively in the four problems. 

b. Nonlinear separable convex problems. In order to test the performance of PFGDA, we 
generate in this case a large test-set of problems and compare the performance of PFGDA 
with 2pDecompAlg and 2dDecompAlg. 

The test problems were generated as follows. We chose the objective function //(-*'/) := 
e~ 7iXi — 1, where y > is a given parameter for i = 1,. .. ,n. Matrix A was generated ran- 
domly in [—1, 1] and then was normalized byA/||A||«o. We generated a sparse vector xq ran- 
domly in [—2, 2] with the density 2% and defined a vector b := Ax. Vector y := (yi , • • • , y n ) T 
was sparse and generated randomly in [0,0.5]. The lower bound /, and the upper bounds m, 
were set to —3 and 3, respectively for all i = 1 , . . . , n. 

We benchmarked three algorithms with performance profiles |@). Recall that a perfor- 
mance profile is built based on a set 5? of n s algorithms (solvers) and a collection S* of 
n p problems. Suppose that we build a profile based on computational time. We denote by 
T ps := computational time required to solve problem p by solver s. We compare the perfor- 
mance of algorithm s on problem p with the best performance of any algorithm on this 

T 

problem; that is we compute the performance ratio r PvS := min i T p 1 ?e ^. ■ Now, let p s (T) := 
^-size{/? € £? | r PtS < t} for T € K+. The function p s : R — > [0, 1] is the probability for 
solver s that a performance ratio is within a factor % of the best possible ratio. We use the 
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term "performance profile" for the distribution function p s of a performance metric. We plot- 
ted the performance profiles in log-scale, i.e. p s (x) := ^-size {p 6 3? Xogjijp.s) ^ 1°S2 T }- 
We tested three algorithms on a collection of 50 random problems with m from 200 to 
1 , 500 and n from 1 , 000 to 15, 000. The profiles are plotted in Figure[2] Based on this test, we 
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Fig. 2 Performance profiles in log, scale of three algorithms. 
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can make the following observations. Both algorithms, PDGDA and 2dDecompAlg, can solve 
all the test problems, while 2pDecompAlg can only solve 46/50 (92%) problems. PFGDA 
requires a significantly fewer iterations than 2pDecompAlg and 2dDecompAlg, and it has 
the best performance on 100% problems in terms of number of iterations. 2dDecompAlg 
is the best in terms of computational time where it reaches 100% the test problem with 
the best performance. However, the number of nonzero elements of the obtained solution 
in PFGDA matches very well the vector of original coefficients xq, while it is rather bad 
in 2pDecompAlg and 2dDecompAlg as we can see from the last figure. In other words, 
2dDecompAlg is not good at finding a sparse solution in this example. 



7 Concluding remarks 



In this paper we have proposed two new dual gradient-based decomposition algorithms for 
solving large-scale separable convex optimization problems. We have analyzed the conver- 
gence of these to schemes and derived the rate of convergence. The first property of these 
methods is that they can handle general convex objective functions. Therefore, they can be 
applied to a wide range of applications compared to second order methods. Second, the new 
algorithms can implemented in parallel and all the algorithmic parameters are updated au- 
tomatically without using any tuning strategy. Third, the convergence rate of Algorithm |2]is 
0(1 /k) which is optimal in the dual decomposition framework. Finally, the complexity esti- 
mates of the algorithms do not depend on the diameter of the feasible set as in proximal-type 
methods, they only depend on the parameter of the barrier functions. 
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